<%
=begin
apps: mxnet
platforms: kubernetes, tanzu-application-catalog
id: distributed_training_example
title: Distributed training example
category: administration
weight: 50
=end %>

We will use the [gluon example from the Apache MXNet (Incubating) official repository](https://github.com/apache/incubator-mxnet/tree/master/example/gluon). Launch it with the following values:

~~~
mode=distributed
cloneFilesFromGit.enabled=true
cloneFilesFromGit.repository=https://github.com/apache/incubator-mxnet.git
cloneFilesFromGit.revision=master
entrypoint.file=image_classification.py
entrypoint.args="--dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync"
entrypoint.workDir=/app/example/gluon/
~~~

Check the logs of the worker node:

~~~
INFO:root:Starting new image-classification task:, Namespace(batch_norm=False, batch_size=32, builtin_profiler=0, data_dir='', dataset='cifar10', dtype='float32', epochs=1, gpus='', kvstore='dist_sync', log_interval=50, lr=0.1, lr_factor=0.1, lr_steps='30,60,90', mode=None, model='vgg11', momentum=0.9, num_workers=4, prefix='', profile=False, resume='', save_frequency=10, seed=123, start_epoch=0, use_pretrained=False, use_thumbnail=False, wd=0.0001)
INFO:root:downloaded http://data.mxnet.io/mxnet/data/cifar10.zip into data/cifar10.zip successfully
[10:05:40] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 1 threads for decoding..
[10:05:45] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 1 threads for decoding..
~~~

If you want to increase the verbosity, set the environment variable PS_VERBOSE=1 or PS_VERBOSE=2 using the commonEnvVars value.

~~~
mode=distributed
cloneFilesFromGit.enabled=true
cloneFilesFromGit.repository=https://github.com/apache/incubator-mxnet.git
cloneFilesFromGit.revision=master
entrypoint.file=image_classification.py
entrypoint.args="--dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync"
entrypoint.workDir=/app/example/gluon/
commonExtraEnvVars[0].name=PS_VERBOSE
commonExtraEnvVars[0].value=1
~~~

You will now see log entries in the scheduler and server nodes.

~~~
[14:22:44] src/van.cc:290: Bind to role=scheduler, id=1, ip=10.32.0.11, port=9092, is_recovery=0
[14:22:53] src/van.cc:56: assign rank=9 to node role=worker, ip=10.32.0.17, port=55423, is_recovery=0
[14:22:53] src/van.cc:56: assign rank=11 to node role=worker, ip=10.32.0.16, port=60779, is_recovery=0
[14:22:53] src/van.cc:56: assign rank=13 to node role=worker, ip=10.32.0.15, port=39817, is_recovery=0
[14:22:53] src/van.cc:56: assign rank=15 to node role=worker, ip=10.32.0.14, port=48119, is_recovery=0
[14:22:53] src/van.cc:56: assign rank=8 to node role=server, ip=10.32.0.13, port=56713, is_recovery=0
[14:22:53] src/van.cc:56: assign rank=10 to node role=server, ip=10.32.0.12, port=57099, is_recovery=0
[14:22:53] src/van.cc:83: the scheduler is connected to 4 workers and 2 servers
[14:22:53] src/van.cc:183: Barrier count for 7 : 1
[14:22:53] src/van.cc:183: Barrier count for 7 : 2
[14:22:53] src/van.cc:183: Barrier count for 7 : 3
[14:22:53] src/van.cc:183: Barrier count for 7 : 4
...
~~~
